Natural language guidance of high-fidelity text-to-speech with synthetic annotations